1. Introduction

Have you ever read horror stories when you were young? Have you ever been scared of horror fiction? No matter the answer is yes or no, let us make an analysis of the horror fiction. We will analyze texts from Edgar Allan Poe, Mary Shelley, and HP Lovecraft with the goal to find out whether they are appropriate for children to read.

  1. Preparations

2.1 Load Libraries

We will take a range of libraries for general data wrangling and general visualisation together with more specialised language-processing tools.

packages.used <- c("ggplot2", "dplyr", "tibble", "tidyr",  "stringr", "tidytext", "topicmodels", "wordcloud2", "wordcloud", "ggridges","corrplot")

# check packages that need to be installed.
packages.needed <- setdiff(packages.used, intersect(installed.packages()[,1], packages.used))

# install additional packages
if(length(packages.needed) > 0) {
  install.packages(packages.needed, dependencies = TRUE, repos = 'http://cran.us.r-project.org')
}

library(ggplot2)
library(dplyr)
library(tibble)
library(tidyr)
library(stringr)
library(tidytext)
library(topicmodels)
library(wordcloud2)
library(wordcloud)
library(ggridges)
library(corrplot)

source("../libs/multiplot.R")

2.2 Load Data

spooky <- read.csv('../data/spooky.csv', as.is = TRUE)
  1. An overview of the data structure and content
head(spooky)
##        id
## 1 id26305
## 2 id17569
## 3 id11008
## 4 id27763
## 5 id12958
## 6 id22965
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   text
## 1                                                                                                                                                                                                                                              This process, however, afforded me no means of ascertaining the dimensions of my dungeon; as I might make its circuit, and return to the point whence I set out, without being aware of the fact; so perfectly uniform seemed the wall.
## 2                                                                                                                                                                                                                                                                                                                                                                                                              It never once occurred to me that the fumbling might be a mere mistake.
## 3                                                                                                                                                                                                                                                                             In his left hand was a gold snuff box, from which, as he capered down the hill, cutting all manner of fantastic steps, he took snuff incessantly with an air of the greatest possible self satisfaction.
## 4                                                                                                                                                                                                                                                                       How lovely is spring As we looked from Windsor Terrace on the sixteen fertile counties spread beneath, speckled by happy cottages and wealthier towns, all looked as in former years, heart cheering and fair.
## 5                                                                                                                                                                                                                                                                                                       Finding nothing else, not even gold, the Superintendent abandoned his attempts; but a perplexed look occasionally steals over his countenance as he sits thinking at his desk.
## 6 A youth passed in solitude, my best years spent under your gentle and feminine fosterage, has so refined the groundwork of my character that I cannot overcome an intense distaste to the usual brutality exercised on board ship: I have never believed it to be necessary, and when I heard of a mariner equally noted for his kindliness of heart and the respect and obedience paid to him by his crew, I felt myself peculiarly fortunate in being able to secure his services.
##   author
## 1    EAP
## 2    HPL
## 3    EAP
## 4    MWS
## 5    HPL
## 6    MWS
summary(spooky)
##       id                text              author         
##  Length:19579       Length:19579       Length:19579      
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character
View(spooky) 

we found each sentence is assigned an ID. So we checked whether the ID and text of each row was unique

length(unique(spooky$id)) == nrow(spooky)
## [1] TRUE
length(unique(spooky$text)) == nrow(spooky)
## [1] TRUE

Any missing value?

sum(is.na(spooky))
## [1] 0

Change author name to be a factor variable

spooky$author <- as.factor(spooky$author)
  1. The Tidy text format

4.1 Using tidy data principles is a powerful way to make handling data easier and more effective, and this is no less true when it comes to dealing with text. Therefore, we first use the unnest_tokens() function to drop all punctuation and transform all words into lower case.

# Make a table with one word per row and remove `stop words` (i.e. the common words).
spooky_wrd <- unnest_tokens(spooky, word, text)
spooky_wrd <- anti_join(spooky_wrd, stop_words, by = "word")
  1. Word Frequency

5.1 Wordcloud

Now we use package “Wordcloud2” to generate some wordclouds to visualise each author’s work. But these plots can only help us have an overview of author’s work. Then, we will use other methods to see the word frequency in each author’s work in the following parts of this charpter.

5.1.1 Wordcloud for EAP

spooky_wrd_EAP <- spooky_wrd%>%
  dplyr::count(author,word)%>%
  dplyr::group_by(author)%>%
  tidyr::spread(author, n)%>%
  dplyr::select(word, EAP) %>%
  dplyr::filter(!is.na(EAP))%>%
  dplyr::arrange(desc(EAP))

figPath = system.file("examples/octopus.jpg",package = "wordcloud2")
wordcloud2(spooky_wrd_EAP, figPath = figPath, size = 1.5, color = "random-light", backgroundColor = "black")
# words <- count(group_by(spooky_wrd, word))$word
# freqs <- count(group_by(spooky_wrd, word))$n
# words_freqs <- data.frame(words,freqs)
# wordcloud(word, n, max.words = 30, color = "purple4")

5.1.2 Wordcloud for MWS

spooky_wrd_MWS <- spooky_wrd%>%
  dplyr::count(author,word)%>%
  dplyr::group_by(author)%>%
  tidyr::spread(author, n)%>%
  dplyr::select(word, MWS) %>%
  dplyr::filter(!is.na(MWS))%>%
  dplyr::arrange(desc(MWS))

figPath = system.file("examples/white-ghost-hi.jpg",package = "wordcloud2")
wordcloud2(spooky_wrd_MWS, figPath = figPath, size = 1.5, color = "random-light", backgroundColor = "black")

5.1.3 Wordcloud for HPL

spooky_wrd_HPL <- spooky_wrd%>%
  dplyr::count(author,word)%>%
  dplyr::group_by(author)%>%
  tidyr::spread(author, n)%>%
  dplyr::select(word, HPL) %>%
  dplyr::filter(!is.na(HPL))%>%
  dplyr::arrange(desc(HPL))

figPath = system.file("examples/black-bird-hi.png",package = "wordcloud2")
wordcloud2(spooky_wrd_HPL, figPath = figPath, size = 1.5, color = "random-light", backgroundColor = "black")

5.2 In this part, we could have a clear thought of the word frequency of each author’s work. We can find out which words they used most frequently.

# Counts number of times each author used each word.
author_words <- count(group_by(spooky_wrd, word, author))

# Counts number of times each word was used.
all_words    <- rename(count(group_by(spooky_wrd, word)), all = n)

author_words <- left_join(author_words, all_words, by = "word")
author_words <- arrange(author_words, desc(all))
author_words <- ungroup(head(author_words, 81))
  
ggplot(author_words) +
  geom_col(aes(reorder(word, all, FUN = min), n, fill = author)) +
  xlab(NULL) +
  coord_flip() +
  facet_wrap(~ author) +
  theme(legend.position = "none")

We can already see that some words are almost equally frequent for all authors, such as “time”. In contrast, “love” is clearly more used by Shelley than by Lovecraft. The word “half” is only found in Poe and Lovecraft, but not in Shelley’s work at a notable frequency.

5.3 Author-dependent word frequencies

Lets start to plot the word frequencies (log scale) comparing two authors at a time and see how words distribute on the plane. Words that are close to the line (y = x) have similar frequencies in both sets of texts. While words that are far from the line are words that are found more in one set of texts than another.

As we can see in the plots below, there are some words close to the line but most of the words are around the line showing a difference between the frequencies.

5.3.1 EAP vs. HPL

#We need to spread the author (key) and the word frequency (value) across multiple columns. There might be NAs if it was not uesd by an author
tib.word_freqs_author <- spooky_wrd %>%
  dplyr::count(author, word) %>%
  dplyr::group_by(author) %>%
  dplyr::mutate(word_freq_author = n/sum(n)) %>%
  dplyr::select(-n)

tib.word_freqs_author <- tib.word_freqs_author%>%
  tidyr::spread(author, word_freq_author)

#Removing incomplete cases - not all words are common for the authors
# when spreading words to all authors - some will get NAs (if not used
# by an author)

author_words_EAP_vs_HPL <- tib.word_freqs_author %>%
  dplyr::select(word, EAP, HPL) %>%
  dplyr::filter(!is.na(EAP) & !is.na(HPL))


ggplot(data = author_words_EAP_vs_HPL, mapping = aes(x = EAP, y = HPL, color = abs(EAP - HPL))) +
  geom_abline(color = "red", lty = 2) +
  geom_jitter(alpha = 0.1, size = 2.5, width = 0.3, height = 0.3) +
  geom_text(aes(label = word), check_overlap = TRUE, vjust = 1.5) +
  scale_x_log10(labels = scales::percent_format()) +
  scale_y_log10(labels = scales::percent_format()) +
  theme(legend.position = "none") +
  labs(y = "HP Lovecraft", x = "Edgard Allan Poe")

5.3.2 EAP vs. MWS

#Removing incomplete cases - not all words are common for the authors
# when spreading words to all authors - some will get NAs (if not used
# by an author)

author_words_EAP_vs_MWS <- tib.word_freqs_author %>%
  dplyr::select(word, EAP, MWS) %>%
  dplyr::filter(!is.na(EAP) & !is.na(MWS))


ggplot(data = author_words_EAP_vs_MWS, mapping = aes(x = EAP, y = MWS, color = abs(EAP - MWS))) +
  geom_abline(color = "red", lty = 2) +
  geom_jitter(alpha = 0.1, size = 2.5, width = 0.3, height = 0.3) +
  geom_text(aes(label = word), check_overlap = TRUE, vjust = 1.5) +
  scale_x_log10(labels = scales::percent_format()) +
  scale_y_log10(labels = scales::percent_format()) +
  theme(legend.position = "none") +
  labs(y = "Mary Wollstonecraft Shelley", x = "Edgard Allan Poe")

5.3.3 HPL vs. MWS

author_words_HPL_vs_MWS <- tib.word_freqs_author %>%
  dplyr::select(word, HPL, MWS) %>%
  dplyr::filter(!is.na(HPL) & !is.na(MWS))


ggplot(data = author_words_HPL_vs_MWS, mapping = aes(x = HPL, y = MWS, color = abs(HPL - MWS))) +
  geom_abline(color = "red", lty = 2) +
  geom_jitter(alpha = 0.1, size = 2.5, width = 0.3, height = 0.3) +
  geom_text(aes(label = word), check_overlap = TRUE, vjust = 1.5) +
  scale_x_log10(labels = scales::percent_format()) +
  scale_y_log10(labels = scales::percent_format()) +
  theme(legend.position = "none") +
  labs(y = "Mary Wollstonecraft Shelley", x = "HP Lovecraft")

5.4 Correlation – Pearson

Next, we will use Pearson for linearity method to calculate a correlation between the authors. In this way, we can verify the similarity or differences these sets of word frequencies by author.

cor.tib.word_freqs_author <- tib.word_freqs_author[2:4]%>%
  cor(use="complete.obs", method="spearman") %>%
  corrplot(type="lower",
           method="pie",
           diag = F)

There is a correlation of around 0.48 to 0.5 between the different authors.

5.5 TF-IDF

TF stands for term frequency or how often a word appears in a text and it is what is studied above in the word cloud. IDF stands for inverse document frequncy, and it is a way to pay more attention to words that are rare within the entire set of text data that is more sophisticated than simply removing stop words. Multiplying these two values together calculates a term’s tf-idf, which is the frequency of a term adjusted for how rarely it is used. We’ll use tf-idf as a heuristic index to indicate how frequently a certain author uses a word relative to the frequency that ll the authors use the word. Therefore we will find words that are characteristic for a specific author, a good thing to have if we are interested in solving the author identification problem.

frequency <- count(spooky_wrd, author, word)
tf_idf    <- bind_tf_idf(frequency, word, author, n)
head(tf_idf)
## # A tibble: 6 x 6
##   author word        n        tf   idf    tf_idf
##   <fct>  <chr>   <int>     <dbl> <dbl>     <dbl>
## 1 EAP    à           9 0.000124  1.10  0.000136 
## 2 EAP    a.m         3 0.0000412 0.405 0.0000167
## 3 EAP    aaem        1 0.0000137 1.10  0.0000151
## 4 EAP    ab          1 0.0000137 1.10  0.0000151
## 5 EAP    aback       2 0.0000275 1.10  0.0000302
## 6 EAP    abandon     7 0.0000961 0     0
tail(tf_idf)
## # A tibble: 6 x 6
##   author word         n        tf   idf     tf_idf
##   <fct>  <chr>    <int>     <dbl> <dbl>      <dbl>
## 1 MWS    youth's      1 0.0000160 0.405 0.00000649
## 2 MWS    youthful    10 0.000160  0     0         
## 3 MWS    youths       2 0.0000320 0.405 0.0000130 
## 4 MWS    zaimi        2 0.0000320 1.10  0.0000352 
## 5 MWS    zeal         7 0.000112  0     0         
## 6 MWS    zest         3 0.0000480 0     0
tf_idf    <- arrange(tf_idf, desc(tf_idf))
tf_idf    <- mutate(tf_idf, word = factor(word, levels = rev(unique(word))))

# Grab the top thirty tf_idf scores in all the words 
tf_idf_30 <- top_n(tf_idf, 30, tf_idf)

ggplot(tf_idf_30) +
  geom_col(aes(word, tf_idf, fill = author)) +
  labs(x = NULL, y = "TF-IDF values") +
  theme(legend.position = "top", axis.text.x  = element_text(angle=45, hjust=1, vjust=0.9))

Note that in the above, many of the words recognized by their tf-idf scores are names. This makes sense – if we see text referencing Raymond, Idris, or Perdita, we know almost for sure that MWS is the author. But some non-names stand out. EAP often uses “monsieur” and “jupiter” while HPL uses the words “bearded” and “attic” more frequently than the others. We can also look at the most characteristic terms per author.

# Grab the top twenty tf_idf scores in all the words for each author
tf_idf <- ungroup(top_n(group_by(tf_idf, author), 20, tf_idf))
  
ggplot(tf_idf) +
  geom_col(aes(word, tf_idf, fill = author)) +
  labs(x = NULL, y = "tf-idf") +
  theme(legend.position = "none") +
  facet_wrap(~ author, ncol = 3, scales = "free") +
  coord_flip() +
  labs(y = "TF-IDF values")

6. Data Visualization

We’ll do some simple numerical summaries of the data to provide some nice visualizations.

p1 <- ggplot(spooky) +
      geom_bar(aes(author, fill = author)) +
      theme(legend.position = "none")


spooky$sen_length <- str_length(spooky$text)
head(spooky$sen_length)
## [1] 231  71 200 206 174 468
p2 <- ggplot(spooky) +
      geom_density_ridges(aes(sen_length, author, fill = author)) +
      scale_x_log10() +
      theme(legend.position = "none") +
      labs(x = "Sentence length [# characters]")


spooky_wrd$word_length <- str_length(spooky_wrd$word)
head(spooky_wrd$word_length)
## [1]  7  8  5 12 10  7
p3 <- ggplot(spooky_wrd) +
      geom_density(aes(word_length, fill = author), bw = 0.05, alpha = 0.3) +
      scale_x_log10() +
      theme(legend.position = "none") +
      labs(x = "Word length [# characters]")

layout <- matrix(c(1, 2, 1, 3), 2, 2, byrow = TRUE)
multiplot(p1, p2, p3, layout = layout)
## Loading required package: grid
## Picking joint bandwidth of 0.0414

Part II Further analysis – are they appropriate for children to read?

The Motion Picture Association of America (MPAA) film rating system is used in the United States and its territories to rate a film’s suitability for certain audiences based on its content. The MPAA rating system is one of various motion picture rating systems that are used to help parents decide what films are appropriate for their children.

We would like to apply the idea of movie rating to book classification to help parents and children choose age-appropriate books.

  1. Overview of Sentiment Analysis

When human readers approach a text, we use our understanding of the emotional intent of words to infer whether a section of text is positive or negative, or perhaps characterized by some other more nuanced emotion like surprise or disgust. In this part, we will do some basic sentiment analysis.

1.1 Use NRC lexicon

get_sentiments('nrc')
## # A tibble: 13,901 x 2
##    word        sentiment
##    <chr>       <chr>    
##  1 abacus      trust    
##  2 abandon     fear     
##  3 abandon     negative 
##  4 abandon     sadness  
##  5 abandoned   anger    
##  6 abandoned   fear     
##  7 abandoned   negative 
##  8 abandoned   sadness  
##  9 abandonment anger    
## 10 abandonment fear     
## # ... with 13,891 more rows
sentiments <- inner_join(spooky_wrd, get_sentiments('nrc'), by = "word")

count(sentiments, sentiment)
## # A tibble: 10 x 2
##    sentiment        n
##    <chr>        <int>
##  1 anger         9869
##  2 anticipation 11258
##  3 disgust       7697
##  4 fear         13927
##  5 joy          10190
##  6 negative     23674
##  7 positive     25175
##  8 sadness      12674
##  9 surprise      6199
## 10 trust        13655
count(sentiments, author, sentiment)
## # A tibble: 30 x 3
##    author sentiment        n
##    <fct>  <chr>        <int>
##  1 EAP    anger         2962
##  2 EAP    anticipation  3982
##  3 EAP    disgust       2261
##  4 EAP    fear          4194
##  5 EAP    joy           3302
##  6 EAP    negative      7659
##  7 EAP    positive      9291
##  8 EAP    sadness       3938
##  9 EAP    surprise      2244
## 10 EAP    trust         5116
## # ... with 20 more rows
ggplot(count(sentiments, sentiment)) + 
  geom_col(aes(sentiment, n, fill = sentiment))

ggplot(count(sentiments, author, sentiment)) + 
  geom_col(aes(sentiment, n, fill = sentiment)) + 
  facet_wrap(~ author) +
  coord_flip() +
  theme(legend.position = "none")

From above plot, we found that not much of a positive mood in HPL’s works. MWS used positive words almost as much as negative ones. There’s much “trust” and “joy” to counteract all the “sadness”, “fear”, and “anger” in the world.

Then, we will seperate positive and negative words of each author to see more details.

  1. EAP
sentiments_EAP <- sentiments%>%
  filter(author == "EAP" & (sentiment == "positive" | sentiment == "negative")) %>%
  count(word, sentiment, sort = TRUE) %>%
  ungroup()
  group_by(sentiments_EAP,sentiment) %>%
  top_n(10, n) %>%
  ungroup() %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free_y") +
  labs(y = "Contribution to negative/positive sentiment", x = NULL) +
  coord_flip() +
  ggtitle("Edgar Allan Poe - Sentiment analysis")

It is interesting that his most-used negative words are “words” and “doubt” - pointing to a different kind of horror of the imagination. His most-uesd positive word is “found”.This made me much more interested in his works. I can’t help wondering what I can find in his work.

  1. HPL
sentiments_HPL <- sentiments%>%
  filter(author == "HPL" & (sentiment == "positive" | sentiment == "negative")) %>%
  count(word, sentiment, sort = TRUE) %>%
  ungroup()
  group_by(sentiments_HPL,sentiment) %>%
  top_n(10, n) %>%
  ungroup() %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free_y") +
  labs(y = "Contribution to negative/positive sentiment", x = NULL) +
  coord_flip() +
  ggtitle("HP Lovecraft - Sentiment analysis")

We found that not much of a positive mood in HPL’s works. Similarly to EAP, the most-uesd positive word is “found”. We also found that there were more words related to color than other two authors. HPL used “black” represents something negative while “white” and “green” are used for positive. In my opinion, “white” could be related to clean and honest and “green” is probably a color with full of hope and love.

  1. MWS
sentiments_MWS <- sentiments%>%
  filter(author == "MWS" & (sentiment == "positive" | sentiment == "negative")) %>%
  count(word, sentiment, sort = TRUE) %>%
  ungroup()
  group_by(sentiments_MWS,sentiment) %>%
  top_n(10, n) %>%
  ungroup() %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free_y") +
  labs(y = "Contribution to negative/positive sentiment", x = NULL) +
  coord_flip() +
  ggtitle("Mary Shelley - Sentiment analysis")

We found that Mary Shelley used positive words almost as much as negative ones. She used “love” a lot as a positive word. “Death” is clearly her most common negative word.

We have a question now. Will the result be different if we use other lexicons?

Let’s change from “nrc” to “bing” to get more information about sentiment analysis.

1.2 Use Bing Lexicon

get_sentiments('bing')
## # A tibble: 6,788 x 2
##    word        sentiment
##    <chr>       <chr>    
##  1 2-faced     negative 
##  2 2-faces     negative 
##  3 a+          positive 
##  4 abnormal    negative 
##  5 abolish     negative 
##  6 abominable  negative 
##  7 abominably  negative 
##  8 abominate   negative 
##  9 abomination negative 
## 10 abort       negative 
## # ... with 6,778 more rows
sentiments_bing <- inner_join(spooky_wrd, get_sentiments('bing'), by = "word")

count(sentiments_bing, sentiment)
## # A tibble: 2 x 2
##   sentiment     n
##   <chr>     <int>
## 1 negative  22919
## 2 positive  13551
count(sentiments_bing, author, sentiment)
## # A tibble: 6 x 3
##   author sentiment     n
##   <fct>  <chr>     <int>
## 1 EAP    negative   7188
## 2 EAP    positive   4844
## 3 HPL    negative   7589
## 4 HPL    positive   2706
## 5 MWS    negative   8142
## 6 MWS    positive   6001
ggplot(count(sentiments_bing, sentiment)) + 
  geom_col(aes(sentiment, n, fill = sentiment))

ggplot(count(sentiments_bing, author, sentiment)) + 
  geom_col(aes(sentiment, n, fill = sentiment)) + 
  facet_wrap(~ author) +
  coord_flip() +
  theme(legend.position = "none")

After applying “bing” lexicon, we found that overall, these three authors used more negative words than positive words.

Then, we will also seperate positive and negative words of each author to see more details.

  1. EAP
sentiments_bing_EAP <- sentiments_bing%>%
  filter(author == "EAP") %>%
  count(word, sentiment, sort = TRUE) %>%
  ungroup()
  group_by(sentiments_bing_EAP,sentiment) %>%
  top_n(10, n) %>%
  ungroup() %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free_y") +
  labs(y = "Contribution to negative/positive sentiment", x = NULL) +
  coord_flip() +
  ggtitle("Edgar Allan Poe - Sentiment analysis")

“doubt” and “death” became the most uesd negative words in EAP’s works. While the word “word” which is the most used negative word in NRC lexicon analysis is no loner existing. The words in positive part are also different from NRC lexicon analysis.

  1. HPL
sentiments_bing_HPL <- sentiments_bing%>%
  filter(author == "HPL") %>%
  count(word, sentiment, sort = TRUE) %>%
  ungroup()
  group_by(sentiments_bing_HPL,sentiment) %>%
  top_n(10, n) %>%
  ungroup() %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free_y") +
  labs(y = "Contribution to negative/positive sentiment", x = NULL) +
  coord_flip() +
  ggtitle("HP Lovecraft - Sentiment analysis")

There is not much positive mood in HPL’s works. The most uesd negative word and positive word are both different from NRC lexicon analysis. In this result, his most uesd negative word is “strange” which is somehow more reasonable than “ancient” in NRC.

  1. MWS
sentiments_bing_MWS <- sentiments_bing%>%
  filter(author == "MWS") %>%
  count(word, sentiment, sort = TRUE) %>%
  ungroup()
  group_by(sentiments_bing_MWS,sentiment) %>%
  top_n(10, n) %>%
  ungroup() %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free_y") +
  labs(y = "Contribution to negative/positive sentiment", x = NULL) +
  coord_flip() +
  ggtitle("Mary Shelley - Sentiment analysis")

From above plot, we can find that she used “love” a lot as a positive word. “Death” is clearly her most common negative word.To my surprise, the most used negative and positive words are the same with the NRC lexicon analysis. She used positive words almost as much as negative ones.

After comparing the results generated by different lexicons, I decided to use “bing” lexicon to do further analysis. Next, we will only study “negative” words to find some clues about the theroy of book classification.

  1. Comparing negativity
bing_neg <- filter(get_sentiments('bing'), sentiment == "negative")
bing_neg
## # A tibble: 4,782 x 2
##    word        sentiment
##    <chr>       <chr>    
##  1 2-faced     negative 
##  2 2-faces     negative 
##  3 abnormal    negative 
##  4 abolish     negative 
##  5 abominable  negative 
##  6 abominably  negative 
##  7 abominate   negative 
##  8 abomination negative 
##  9 abort       negative 
## 10 aborted     negative 
## # ... with 4,772 more rows
negative <- inner_join(spooky_wrd, bing_neg, by = "word")
head(negative)
##        id author        word word_length sentiment
## 1 id26305    EAP     dungeon           7  negative
## 2 id17569    HPL     mistake           7  negative
## 3 id11008    EAP incessantly          11  negative
## 4 id12958    HPL   perplexed           9  negative
## 5 id12958    HPL      steals           6  negative
## 6 id22965    MWS     intense           7  negative
count(negative, word, sort = TRUE)
## # A tibble: 2,403 x 2
##    word        n
##    <chr>   <int>
##  1 death     380
##  2 strange   283
##  3 fear      240
##  4 dark      223
##  5 dead      203
##  6 lost      181
##  7 object    165
##  8 doubt     159
##  9 fell      157
## 10 wild      157
## # ... with 2,393 more rows

First, we plot a frequency comparison of these “negative” words.

neg_words     <- count(group_by(negative, word, author))
neg_words_all <- count(group_by(negative, word))

neg_words <- left_join(neg_words, neg_words_all, by = "word")
neg_words <- arrange(neg_words, desc(n.y))
neg_words <- ungroup(head(neg_words, 81))

ggplot(neg_words) +
  geom_col(aes(reorder(word, n.y, FUN = min), n.x, fill = author)) +
  xlab(NULL) +
  coord_flip() +
  facet_wrap(~ author) +
  theme(legend.position = "none")

Then, we go a step further and assign a “negativity fraction” to each sentence; defined in the same way as the other index: # negative / (# negative + # positive). We plot the distribution of these negativity indeces for the three authors:

p1 <- ggplot(sentiments_bing, aes(author, fill = sentiment)) + geom_bar(position = "fill")

p2 <- group_by(sentiments_bing, author, id, sentiment) %>%
  count() %>%
  spread(sentiment, n, fill = 0) %>%
  group_by(author, id) %>%
  summarise(neg = sum(negative),
            pos = sum(positive)) %>%
  arrange(id) %>%
  mutate(frac_neg = neg/(neg + pos)) %>%
  ggplot(aes(frac_neg, fill = author)) +
  geom_density(bw = .2, alpha = 0.3) +
  theme(legend.position = "right") +
  labs(x = "Fraction of negative words per sentence")

layout <- matrix(c(1,2),1,2,byrow=TRUE)
multiplot(p1, p2, layout=layout)

HPL’s works are more negative than EAP and MWS. From the above plots, we know clearly when the fraction of negative words per sentence is between 0 and 0.5, the probability of negative words of MWS is lager than EAP and HPL.

  1. Vilence level analysis

The Motion Picture Association of America (MPAA) film rating system has five components: Violence, Language, Substance abuse, Nudity and Sexual content. Considering the last 4 are more difficult to choose the standard for comparison, so here we only analyze the first one – Violence.

We first built a dictionary contains all the common violent words. The violent words come from http://www.thesaurus.com. For example: crazy, cruel, fierce.

Then, we counted these words appeared in entire dataset and in each author’s works.

#build a dictionary which contains all the common violence words. 
Vio <- data.frame(word=c('brutal','crazy','cruel','fierce','homicidal','hysterical','murderous','passionate','potent','powerful','savage','uncontrollable','vicious','agitated','aroused','berserk','bloodthirsty','coercive','demoniac','desperate','distraught','disturbed','enraged','fiery','forceful','forcible','frantic','fuming','furious','great','headstrong','hotheaded','impassioned','impetuous','inflamed','intemperate','mad','maddened','maniacal','mighty','raging','riotous','rough','strong','ungovernable','unrestrained','urgent','vehement','wild'),stringsAsFactors=FALSE)
head(Vio)
##         word
## 1     brutal
## 2      crazy
## 3      cruel
## 4     fierce
## 5  homicidal
## 6 hysterical
#count these words appeared in spooky data
vio_words <- inner_join(Vio, spooky_wrd, by = "word")
lapply(vio_words, function(x)which(is.na(x)))
## $word
## integer(0)
## 
## $id
## integer(0)
## 
## $author
## integer(0)
## 
## $word_length
## integer(0)
vio_words <- count(group_by(vio_words, word,author))
head(vio_words)
## # A tibble: 6 x 3
## # Groups: word, author [6]
##   word     author     n
##   <chr>    <fct>  <int>
## 1 agitated EAP       11
## 2 agitated HPL        1
## 3 agitated MWS       14
## 4 aroused  EAP       11
## 5 aroused  HPL       15
## 6 aroused  MWS        1
ggplot(vio_words) + 
  geom_col(aes(word,n,fill = word))

ggplot(vio_words) + 
  geom_col(aes(word, n, fill = word)) + 
  facet_wrap(~ author) +
  coord_flip() +
  theme(legend.position = "none")

We found that these violent terms are indeed distributed among the three authors’ works. Due to the lack of further information, we can not set a standard to justify whether there is too much violent mood in their works. Maybe in the future, we will have a baseline, for example, when these violent terms make up about 20% or more of the entire article, children need to read such books under the guidance of their parents.

  1. Topic Modeling

Topic modeling is a method for unsupervised classification of documents by themes, similar to clustering on numeric data. We’re trying to look through the content of each author’s works to identify whether they are appropriate for children to read by running Latent Dirichlet Allocation.

# Counts how many times each word appears in each sentence
sent_wrd_freqs <- count(spooky_wrd, id, word)
head(sent_wrd_freqs)
## # A tibble: 6 x 3
##   id      word         n
##   <chr>   <chr>    <int>
## 1 id00001 content      1
## 2 id00001 idris        1
## 3 id00001 mine         1
## 4 id00001 resolve      1
## 5 id00002 accursed     1
## 6 id00002 city         1
# Creates a DTM matrix
spooky_wrd_tm <- cast_dtm(sent_wrd_freqs, id, word, n)
spooky_wrd_tm
## <<DocumentTermMatrix (documents: 19467, terms: 24941)>>
## Non-/sparse entries: 193944/485332503
## Sparsity           : 100%
## Maximal term length: 19
## Weighting          : term frequency (tf)
length(unique(spooky_wrd$id))
## [1] 19467
length(unique(spooky_wrd$word))
## [1] 24941

The matrix spooky_wrd_tm is a sparse matrix with 19467 rows, corresponding to the 19467 ids (or originally, sentences) in the spooky_wrd dataframe, and 24941 columns corresponding to the total number of unique words in the spooky_wrd dataframe. So each row of spooky_wrd_tm corresponds to one of the original sentences. The value of the matrix at a certain position is then the number of occurences of that word (determined by the column) in this specific sentence (determined by the row). Since most sentence/word pairings don’t occur, the matrix is sparse meaning there are many zeros.

For LDA we must pick the number of possible topics. I tried 12, 10, 8 and 6. Finnally I chose 6.

spooky_wrd_lda    <- LDA(spooky_wrd_tm, k = 6, control = list(seed = 1234))
spooky_wrd_topics <- tidy(spooky_wrd_lda, matrix = "beta")
spooky_wrd_topics
## # A tibble: 149,646 x 3
##    topic term         beta
##    <int> <chr>       <dbl>
##  1     1 content 0.000198 
##  2     2 content 0.000316 
##  3     3 content 0.000359 
##  4     4 content 0.0000223
##  5     5 content 0.0000294
##  6     6 content 0.000107 
##  7     1 idris   0.000499 
##  8     2 idris   0.000562 
##  9     3 idris   0.000467 
## 10     4 idris   0.000406 
## # ... with 149,636 more rows

We note that in the above we use the tidy function to extract the per-topic-per-word probabilities, called “beta” or \(\beta\), for the model. The final output has a one-topic-per-term-per-row format. For each combination, the model computes the probability of that term being generated from that topic.

# Grab the top five words for each topic.
spooky_wrd_topics_5 <- ungroup(top_n(group_by(spooky_wrd_topics, topic), 5, beta))
spooky_wrd_topics_5 <- arrange(spooky_wrd_topics_5, topic, -beta)
spooky_wrd_topics_5 <- mutate(spooky_wrd_topics_5, term = reorder(term, beta))

ggplot(spooky_wrd_topics_5) +
  geom_col(aes(term, beta, fill = factor(topic)), show.legend = FALSE) +
  facet_wrap(~ topic, scales = "free", ncol = 3) +
  coord_flip()

In the above, we found several words appear in many topics. For instance, topic 1,4,5,6 all contain the word “time”. “found” is also a word contained by topic 2,3 and 6. At this point, I seem a bit understated why “found” is a negative term. When “found” appears together with other words related to people’s body part like “eyes”,“head”,“heart” in a topic, it really makes people feel creepy. What’s more, there are also some words that described time, such as “night”, “dark” and also “time” itself which made me feel even more frightening.

Also, we see that these 6 topics are quite similar. I can hardly tell the difference between these topics. The only thing I can be sure of is that these words remind me of all the horror stories I might think of in my life.

Therefore, let’s study terms that have the greatest difference in probabilities between the topics, ignoring the words that are shared with similar frequency between topics. We choose only the first 3 topics as example and visualise the differences by plotting log ratios: \(log_{10}(\beta \text{ of topic x }/ \beta \text{ of topic y})\). So if a word is 10 times more frequent in topic x the log ratio will be 1, whereas it will be -1 if the word is 10 times more frequent in topic y.

spooky_wrd_topics <- mutate(spooky_wrd_topics, topic = paste0("topic", topic))
spooky_wrd_topics <- spread(spooky_wrd_topics, topic, beta)

spooky_wrd_topics_12 <- filter(spooky_wrd_topics, topic1 > .001 | topic2 > .001)
spooky_wrd_topics_12 <- mutate(spooky_wrd_topics_12, log_ratio = log10(topic2 / topic1))
spooky_wrd_topics_12 <- group_by(spooky_wrd_topics_12, direction = log_ratio > 0)
spooky_wrd_topics_12 <- ungroup(top_n(spooky_wrd_topics_12, 5, abs(log_ratio)))
spooky_wrd_topics_12 <- mutate(spooky_wrd_topics_12, term = reorder(term, log_ratio))

p1 <- ggplot(spooky_wrd_topics_12) +
      geom_col(aes(term, log_ratio, fill = log_ratio > 0)) +
      theme(legend.position = "none") +
      labs(y = "Log ratio of beta in topic 2 / topic 1") +
      coord_flip()


spooky_wrd_topics_23 <- filter(spooky_wrd_topics, topic2 > .001 | topic3 > .001)
spooky_wrd_topics_23 <- mutate(spooky_wrd_topics_23, log_ratio = log10(topic3 / topic2))
spooky_wrd_topics_23 <- group_by(spooky_wrd_topics_23, direction = log_ratio > 0)
spooky_wrd_topics_23 <- ungroup(top_n(spooky_wrd_topics_23, 5, abs(log_ratio)))
spooky_wrd_topics_23 <- mutate(spooky_wrd_topics_23, term = reorder(term, log_ratio))

p2 <- ggplot(spooky_wrd_topics_23) +
      geom_col(aes(term, log_ratio, fill = log_ratio > 0)) +
      theme(legend.position = "none") +
      labs(y = "Log ratio of beta in topic 3 / topic 2") +
      coord_flip()

spooky_wrd_topics_13 <- filter(spooky_wrd_topics, topic3 > .001 | topic1 > .001)
spooky_wrd_topics_13 <- mutate(spooky_wrd_topics_13, log_ratio = log10(topic3 / topic1))
spooky_wrd_topics_13 <- group_by(spooky_wrd_topics_13, direction = log_ratio > 0)
spooky_wrd_topics_13 <- ungroup(top_n(spooky_wrd_topics_13, 5, abs(log_ratio)))
spooky_wrd_topics_13 <- mutate(spooky_wrd_topics_13, term = reorder(term, log_ratio))

p3 <- ggplot(spooky_wrd_topics_13) +
      geom_col(aes(term, log_ratio, fill = log_ratio > 0)) +
      theme(legend.position = "none") +
      labs(y = "Log ratio of beta in topic 3 / topic 1") +
      coord_flip()


layout <- matrix(c(1,2,3), 3, 1, byrow = TRUE)
multiplot(p1, p2, p3, layout = layout)

In the above, I guess topic 1 may occur in the wild because this topic contains words like “moon”, “sky” and “air” which are more popular than other two. Topic 2 may happen at a late night party. The story may have taken place in a study room full of books, all of them standing together.

Think further

Due to the lack of appropriate judgments of children, the film classification system has largely avoided the children’s exposure to violence, drugs and other works that are likely to have adverse effects on them. Similarly, in the literary world, similar hierarchies are needed to help children grow healthily. In this project, I didn’t find a way to set up a standard to justify whether there is too much violent mood in their works. I would like to explore more in this area and hope in the future, we will have a baseline, for example, when these violent terms make up about 20% or more of the entire article, children need to read such books under the guidance of their parents.